Regression analysis

The procedure using sample correlation to derive the F-test

image-20200724095612281

image-20200724095709817

image-20200724095748636

image-20200724095839261

image-20200724100254921

Consider multivariate regression:

First, we know from straight line that:

image-20200724100759557

This actually follows from image-20200724101057390

Since:

image-20200724101119199

image-20200724101227326

image-20200724101343958

image-20200724101426603

(这个公式最后一个等号有个typo)

Example problem:

image-20200801114722402

image-20200801114804087

Method of Orthogonal Projections

image-20200724112434521

####

Method of Lagrange Multipliers

image-20200724112535112

Regression with orthogonal predicted variables

image-20200821163434067

image-20200821163555801

Regression hypothesis testing

image-20200731170414678

Matrix algebra cheatsheet

image-20200803121119795

image-20200724112836837

image-20200724112905348

image-20200724112932964

Diagonostic tools and model selection

image-20200724115351328

image-20200724115619202

image-20200724115736215

image-20200724120914749

image-20200724121158962

High leverage point will cover the evidence of an outlier

image-20200724143822312

image-20200724161616216

Straight line regression 的一些结论

image-20200801123251530

Straight line prediction interval and band

image-20200801124028820

image-20200801124506253

Introducing new explanatory variables

image-20200724144120271

Key step to prove this theorem is by orthogonalizing the model.

image-20200724145003721

image-20200731154849741

image-20200731154856114

Generalized least square

image-20200724161429976

High leverage points

image-20200724161912695

DFFITS

image-20200731171804285

Cook’s distance

image-20200731171839367

Design matrix of less than full rank

image-20200724162107512

image-20200724162124714

There are two ways to solve this. One is to drop some variables like $\alpha_1$ and $\gamma_1$, since we cannot estimate so many variables. Another is to expand the design matrix like this.

image-20200724162232272

Notice that it we add two more observations in our data and our goal is to enforce that $\alpha_1+\alpha_2 = 0$ and $\beta_1+\beta_2 = 0$. The estimated coefficient using the expanded design matrix will satifsy the two constraints. Why this works?

Intuitively, now we have n+2 data points, and we can do as better as we have done for the first n-1 points by having the two constraints and make the error in the remaining 2 data points to be zero.

Put it in another way. Previously, you need to think how to take values of $\alpha_1-\alpha_2$ to make the error smallest, now, we want as well minimize the error due to the last two obeservations and luckily we find we can solve them explicitly to reduce the error to 0.

multicollinearity in regression

Suppose you are fitting a regression model with two highly correlated explanatory variables. When you fit it use simple linear regression model:

image-20200723120043056

We can see both term are not significant. But if we fit them separately, we can find:

image-20200723120224941

image-20200723120237788

both terms appear significant.

Intuitively, when we test if a term is significant or not, our null hypothesis is setting the coefficient before that variable to be 0 and refit the model to see whether the model performs worse or not. If the model performs very bad, we can conclude that variable is very important. But in our situation. two variables are highly correlated, no matter which one you remove, the other one can still explain the response quite well, so if we include both in the model, both terms appear insignificant.

Assumption on error term

image-20200725231230718

Some properties:

  • under A2, an unbiased estimator of $\sigma^2$ is $|Y-X \widehat{\beta}|^{2} /(n-p)$

  • The LSE and the residual vector is always uncorrelated.image-20200726123518105

  • Under A3, image-20200726123633868

    image-20200726123817182

    image-20200726124323727

    image-20200726125305977

    image-20200726125413047

####

Fisher information of the regression parameters

image-20200726125805019

image-20200726125910295

Weighted least squares estimators

image-20200726130031685

image-20200726130113223

image-20200726130244327

image-20200726130439241

image-20200726130836578

Gauss-Markov Theorem

image-20200725231943193

image-20200725232208118

image-20200725232728681

image-20200726121852175

Random effect model

image-20200726153913176

image-20200726153932656

image-20200726154628814

image-20200726154713970

The restricted maximum likelihood estimators

image-20200726171546822

image-20200726171609879

image-20200726171627277

image-20200726171645670

Qualify exam 2019

image-20200731151925352

  • (a) 问是test linear combination,用F test 或者t test,F test 分子上是RSSH - RSS, 分母是$S^2$。
  • (b) 问要知道$Y_{n+1}$和$\hat{Y}$ 是独立的,方差是他们各自方差之和。
  • (c)问我们根据(b)问可以推得$Y_{n+1}$的分布。
  • (d)问是add more explanatory variables 的问题,残差可以用$(I_n - P)X$写出。
  • (e)问用的公式是这个

Qualify exam 2018

image-20200731164305582

image-20200731171234105

  • (a)问用到了 regression with orthogonal predicted variables
  • (b)问用到了展开$(Y-X\hat{\beta})’(Y-X\hat{\beta})$

  • (c)问refer to 这里

  • (d)问用(a)问的结论,因为X1 和 X2是orthogonal。
  • H indicates leverage, residual indicates outliers and DFFITS indicates the influential point.

image-20200806134726081

Qualify exam 2017

image-20200731172320912

image-20200731172341033

  • (a)问这里要test $\beta_1 = 0$, 这里我们只有sample correlation information。we refer to here

  • 基本思路是$ r^2 = R^2 = \frac{RSS_{H_0}-RSS}{RSS_{H_0}}$

  • (b)问model diagnostic的基本步骤是independence, constant variance, and normality.

  • (e)问ncp的计算方法:

    $Y \sim N(\mu,\sigma^2)$ , $(\frac{Y-\mu}{\sigma})^2 \sim \chi_1(a)$, $a = \mu^2/ \sigma^2$

Qualify exam 2016

image-20200801112848977

  • (f) 问如果不想重新算一遍的话,可以记一下image-20200801113130976

Qualify exam 2012

image-20200802215858749

image-20200802215923477

image-20200802215953516

image-20200802220004742

  • (a): the design matrix are highly correlated, so if our hypothesis is $\beta_1 = 0$ or $\beta_2 = 0$, under these hypothesis, the model will both be a perfect fit. So both $\beta_1$ and $\beta_2$ are not significant. $\beta_0$ Is significant. To calculate the coefficients, just use do LSE.

  • (b) i. Due to multicollinearity.

Qualify exam 2013

image-20200802220741583

image-20200802220747796

  • (a) MLE of $\beta$ is the same as LSE, and MLE of $\sigma^2$ is $Y^{‘}NY / n$,
  • (c),(d): larger power means smaller variance of $\hat{\beta}$.

image-20200802222103297

  • (c): use

image-20200802222214145

(e): To minimize C(k,$\lambda$), we could first do C(k+1,$\lambda$)-C(k,$\lambda$), if <0, continue. Then we can find C(k+1,$\lambda$)-C(k,$\lambda$) is exactly $Z_i-\ \lambda$。Then $\hat{k}$ will be some kind of change point.

Qualify exam 2014

image-20200805230047939

image-20200805230106293

image-20200805230120784

image-20200805230133478

image-20200805230148250

image-20200805230202657

  • (a),(b): add sum to zero constraint.
  • (c): fisher-lsd test
  • (d) use F = (RSSH0 - RSS) / RSS to test main effect and also interaction effect
  • (e): no, the condition that F test can be used is that the numerator chi square random variable is independent of the denominator’s chi square random variable. The pair like trt and trt+bWeight can form a F test, since RSSH0 - RSS / RSS is a F distribution. But other pairs like trt and bWeight can not achieve this.
  • (g): it is a prediction of a new sample, the variance should be $x\hat{\sigma^2}(X^{‘}X^{-1})x^{‘}$
  • (h): see if there is multicollinearity

Qualify exam 2015

image-20200807222939655

  • (a): 这里需要一些技巧构造X,需要对参数就行一些reparamization。

    这里先固定住第一第二第四列,然后求第三列让他和其余列全部正交。

    image-20200808142212377

  • (c): 死算,用orthogonal的性质消去一些项。

QE sample 1

image-20200817000356393

QE sample 2

image-20200817000455307